I’m feeling a bit anxious. I have never coded before (as far as I know) and never used R, Rstudio or GitHub.
Let’s see, if I can manage to complete this course.
What do I expect to learn?
1. I really don’t know!
2. I hope to learn something completely new for me.
https://github.com/EmmiD/IODS-project
1. Data of this study includes basic information about the students of a statistic course, their attitude, different kind of learning types and academic success. Data are available from 166 students. Questions considering different learning types have been combined into combination variables called deep, surface and strategic learning. Other variables are gender, age, attitude and exam points.
lrn14 <- read.csv("learning2014.csv")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
keep_columns <- c("gender","Age","Attitude", "deep", "stra", "surf", "Points")
learning2014 <- select(lrn14, one_of(keep_columns))
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 7
2. Below you can see the summaries of the variables and a graphical overview of the data. There are more female participants in this data. Age or gender didn’t affect the points. There’s a correlation between points and attitude.
summary(learning2014)
## gender Age Attitude deep
## Length:166 Min. :17.00 Min. :14.00 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333
## Mode :character Median :22.00 Median :32.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :4.917
## stra surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
p <- ggpairs(learning2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p
3. and 4. I did choose three explanatory variables (attitude, strategic learning and surface learning) and fit the in a regression model where exam points are the target variable. Summary shown below.
my_model2 <- lm(Points ~ Attitude + stra + surf, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
While strategic and surface learning didn’t have a statistically significant relationship with points, I also tested deep learning and age as explanatory variables. Summary shown below.
my_model2 <- lm(Points ~ Attitude + deep + Age, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ Attitude + deep + Age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.206 -3.430 0.204 3.979 10.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.60773 3.38966 4.605 8.32e-06 ***
## Attitude 0.35941 0.05696 6.310 2.56e-09 ***
## deep -0.60275 0.75031 -0.803 0.423
## Age -0.07716 0.05322 -1.450 0.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared: 0.2043, Adjusted R-squared: 0.1896
## F-statistic: 13.87 on 3 and 162 DF, p-value: 4.305e-08
In a model in which I used attitude and strategic learning as explanatory variables (summary below), attitude has a statistical significance p <0.001 and strategic learning p < 0.1. Multiple R-squared is 0.2048 meaning that about 20 % of the variation in exam points is explained by the variation in attitude and strategic learning.
my_model2 <- lm(Points ~ Attitude + stra, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
I also tested simple regression model (attitude vs points), where p < 0.001. Multiple R-squared is 0.1906 meaning that a bit less that 20 % of the variation in exam points is explained by the variation in attitude only.
qplot(Attitude, Points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
my_model <- lm(Points ~ Attitude, data = learning2014)
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
5. I produced diagnostic plot for simple regression model (attitude vs points).
QQ-plot: both ends of the line curve, errors of the model are not normally distributed
Residuals vs Fitted values: no pattern, errors don’t depend on the explanatory variable
Residuals vs Leverage: no observation has unusually high impact
plot(my_model, which = c(1,2,5))
Here we go again…
#libraries needed
library(dplyr)
library(ggplot2)
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.5
library(gmodels)
## Warning: package 'gmodels' was built under R version 4.0.5
Data of this study are combined from two datasets: one regarding the performance in mathematics and other in Portuguese language. Data is collected from two Portuguese secondary schools. It includes student grades, social and school related features together with alcohol consumption.
pormath <- read.table("https://github.com/rsund/IODS-project/raw/master/data/alc.csv", sep= ",", header=TRUE)
colnames(pormath)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
My personal hypothesis is that male and students who go out with their friends more often use more alcohol than female and those student who stay at home (sex, goout). I also think that student who want to educate higher use less alcohol (higher) and students who has more absences use more alcohol (absences).
gather(pormath) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
#boxplot high use and goout
g2 <- ggplot(pormath, aes(x = high_use, y = goout, col = sex))
g2 + geom_boxplot()
#boxplot high use and absences
g4 <- ggplot(pormath, aes(x = high_use, y = absences, col = sex))
g4 + geom_boxplot()
#crosstable high use and sex
CrossTable(pormath$high_use, pormath$sex)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | pormath$sex
## pormath$high_use | F | M | Row Total |
## -----------------|-----------|-----------|-----------|
## FALSE | 154 | 105 | 259 |
## | 2.244 | 2.500 | |
## | 0.595 | 0.405 | 0.700 |
## | 0.790 | 0.600 | |
## | 0.416 | 0.284 | |
## -----------------|-----------|-----------|-----------|
## TRUE | 41 | 70 | 111 |
## | 5.235 | 5.833 | |
## | 0.369 | 0.631 | 0.300 |
## | 0.210 | 0.400 | |
## | 0.111 | 0.189 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 195 | 175 | 370 |
## | 0.527 | 0.473 | |
## -----------------|-----------|-----------|-----------|
##
##
#crosstable high use and "higher"
CrossTable(pormath$high_use, pormath$higher)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | pormath$higher
## pormath$high_use | no | yes | Row Total |
## -----------------|-----------|-----------|-----------|
## FALSE | 7 | 252 | 259 |
## | 1.575 | 0.071 | |
## | 0.027 | 0.973 | 0.700 |
## | 0.438 | 0.712 | |
## | 0.019 | 0.681 | |
## -----------------|-----------|-----------|-----------|
## TRUE | 9 | 102 | 111 |
## | 3.675 | 0.166 | |
## | 0.081 | 0.919 | 0.300 |
## | 0.562 | 0.288 | |
## | 0.024 | 0.276 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 16 | 354 | 370 |
## | 0.043 | 0.957 | |
## -----------------|-----------|-----------|-----------|
##
##
From the boxplots I can see that students who go often out, are significantly more often high users of alcohol, specially men. Same trend can be seen in absences. In crosstabs high_use and sex, we can see that OR is 2,5 in male compared to female, supporting my hypothesis. The number of student not willing to educate higher is so small, that it is not possible to do predictions according to that information.
m <- glm(high_use ~ goout + absences + sex + higher, data = pormath, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ goout + absences + sex + higher, family = "binomial",
## data = pormath)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7787 -0.8120 -0.5286 0.7990 2.4772
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.67317 0.77959 -4.712 2.46e-06 ***
## goout 0.72110 0.12093 5.963 2.48e-09 ***
## absences 0.08279 0.02289 3.617 0.000298 ***
## sexM 0.99134 0.26216 3.781 0.000156 ***
## higheryes -0.48263 0.59294 -0.814 0.415674
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 373.43 on 365 degrees of freedom
## AIC: 383.43
##
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02539593 0.005295764 0.1144689
## goout 2.05669840 1.631954118 2.6247236
## absences 1.08631589 1.040394135 1.1392604
## sexM 2.69484233 1.621899762 4.5434069
## higheryes 0.61716027 0.185899096 1.9580038
Students who go out often, are significantly more often high users of alcohol (OR 2.1, confidence interval 1.6-2.6)
Students who have lots of abcences are also more often high users (OR 1.09, CI 1.05-1.14), but the influence of absences is not that big as the influence of going out.
Male students are significantly more often high users than female (OR 2.7, CI 1.6-4.5)
There is no significant difference in alcohol use between students who want to educate higher or lower (OR 0.6, CI 0.2-2.0)
KESKEN, NOT FINISHED
#libraries needed
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(qqplotr)
## Warning: package 'qqplotr' was built under R version 4.0.5
##
## Attaching package: 'qqplotr'
## The following objects are masked from 'package:ggplot2':
##
## stat_qq_line, StatQqLine
This data called “Boston” are about housing values in suburbs of Boston. In the columns you can find information about the crime rate in town, distances from employment centres, accessibility and socioeconomical differences in town, for example. Boston data frame has 506 rowns and 14 columns. The dimensions and the structure of the data are described below.
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Summaries of the variables are shown below.
There are a strong positive correlation between rad and tax, medv and rm, nox and age, indus and nox, and indus and tax. There are a strong negative correlation between indus and dis, nox and dis, age and dis, lstat and medv.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
#I standardized the dataset and printed out summaries of the scaled data
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# I created a categorical variable of the crime rate in the Boston dataset. I used the quantiles as the break points in the categorical variable.
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# I dropped the old crime rate variable from the dataset and added the new categorical value to scaled data.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# I divided the dataset to train and test sets, so that 80% of the data belongs to the train set.I also saved the crime categories from the test set and removed the categorical crime variable from the test data.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
I fit the linear discriminant analysis on the train set using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
Above I saved the crime categories from the test set and then removed the categorical crime variable from the test dataset.
Now I predict the classes with the LDA model on the test data.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 6 1 0
## med_low 9 14 4 0
## med_high 0 15 12 2
## high 0 0 0 24
We can see, that this model was good predicting high crime rate, which it did perfectly. Other classes were not that well predicted. In other classes around 20 % of cases were predicted in wrong class.
# I reloaded the Boston dataset and standardized it.
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# I calculated the distances between observations
dist_eu <- dist(boston_scaled)
dist_man <- dist(Boston, method = 'manhattan')
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.145 279.505 342.899 509.707 1198.265
# I ran k-means algorithm on the data.
km <-kmeans(boston_scaled, centers = 3)
pairs(boston_scaled, col = km$cluster)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
pairs(boston_scaled[1:6], col = km$cluster)
I found the optimal number of clusters 2.
Finally, a small test with the code provided in Super-Bonus - exercise.
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')